Exploratory data analysis¶

In [1]:
# Libraries
import pandas as pd
import numpy as np
import janitor
from pandas_profiling import ProfileReport
import plotly.express as px
from helpers import bar_plotter, merger, greenspace_plotter, df_std
In [2]:
# Read in distance to green spaces
distance = pd.read_csv("data/green_spaces.csv").clean_names()

# Read in neighbourhood ratings
neighbourhood = pd.read_csv("data/neighbourhood_rating.csv").clean_names()

#Read in community ratings
community = pd.read_csv("data/community_belonging.csv").clean_names()
In [3]:
distance
Out[3]:
featurecode datecode measurement units value distance_to_nearest_green_or_blue_space age gender urban_rural_classification simd_quintiles type_of_tenure household_type ethnicity
0 S12000026 2013 95% Lower Confidence Limit, Percent Percent Of Adults 71.0 A 5 minute walk or less All All All All All All All
1 S12000045 2017 Percent Percent Of Adults 59.0 A 5 minute walk or less All All All All All Pensioners All
2 S12000026 2014 95% Upper Confidence Limit, Percent Percent Of Adults 86.9 A 5 minute walk or less All All All All All All All
3 S12000026 2017 95% Upper Confidence Limit, Percent Percent Of Adults 80.9 A 5 minute walk or less All All All All All All All
4 S12000026 2017 95% Upper Confidence Limit, Percent Percent Of Adults 79.6 A 5 minute walk or less All All All All All Pensioners All
... ... ... ... ... ... ... ... ... ... ... ... ... ...
38446 S92000003 2018 Percent Percent Of Adults 26.0 Within a 6-10 minute walk All All All All All All Other
38447 S92000003 2018 95% Lower Confidence Limit, Percent Percent Of Adults 20.4 Within a 6-10 minute walk All All All All All All Other
38448 S92000003 2019 95% Upper Confidence Limit, Percent Percent Of Adults 7.8 Don't Know All All All All All All Other
38449 S92000003 2014 95% Lower Confidence Limit, Percent Percent Of Adults 15.8 Within a 6-10 minute walk All All All All All All Other
38450 S12000036 2018 95% Upper Confidence Limit, Percent Percent Of Adults 36.1 Within a 6-10 minute walk All All All All All All Other

38451 rows × 13 columns

The datasets contain the percentage of adults that are classified according to the independent variable (in this case distance_to_nearest_green_or_blue_space ) given one other dependent variable, while the other variables are held constant (signified by value All).

In [4]:
distance.distance_to_nearest_green_or_blue_space.value_counts()
Out[4]:
A 5 minute walk or less      10377
Within a 6-10 minute walk    10377
An 11 minute walk or more    10350
Don't Know                    7347
Name: distance_to_nearest_green_or_blue_space, dtype: int64
In [5]:
community.walking_distance_to_nearest_greenspace.value_counts()
Out[5]:
All                     39375
Less than 10 minutes     3303
More than 10 minutes      828
Don't Know                105
Name: walking_distance_to_nearest_greenspace, dtype: int64
In [6]:
neighbourhood.neighbourhood_rating.value_counts()
Out[6]:
Very good      9564
Fairly good    9564
Fairly poor    8781
Very poor      6828
No opinion     3318
Name: neighbourhood_rating, dtype: int64
In [7]:
community.community_belonging.value_counts()
Out[7]:
Very strongly          9564
Fairly strongly        9564
Not very strongly      9564
Not at all strongly    9186
Don't know             5733
Name: community_belonging, dtype: int64

The above are the independent variables. All datasets contain a variable that represents walking distance to a green space, but the first one is binned differently with useful extra granularity, as it includes walks of 5 minutes or less and 6-10 walks separately. The other two datasets contain only whether the distance is more than or less than 10 minutes. Moreover, some of these records are contradictory (e.g. a percentage of people reporting being both less than and more than 10 minutes away from green spaces). For these reasons, walking_distance_to_nearest_greenspace from community and neighbourhood will be kept before joining the three datasets, then re-binned after the join.

Lastly, regional values will be filtered after the merge, as we are considering Scotland as a whole for the analysis questions, and other variables (e.g. urban or rural) can be assumed to contain at least some of the regional information.

Preparing data: creating a single dataframe¶

The custom function merger() below joins the three datasets together in preparation for further analysis.

In [8]:
survey = merger(distance, community, neighbourhood)
In [9]:
# These are the resulting bins and counts for the new variable
survey.nearest_green_space.value_counts()
Out[9]:
All                          24235
Within a 6-10 minute walk     5379
An 11 minute walk or more     3980
A 5 minute walk or less       3460
Don't Know                      61
Name: nearest_green_space, dtype: int64
In [10]:
# Filtering out regional values for analysis
survey = survey.query("featurecode == 'S92000003'").drop('featurecode', axis=1)
In [11]:
survey.sample(10)
Out[11]:
year percent_adults nearest_green_space age gender urban_rural_classification simd_quintiles type_of_tenure household_type ethnicity community_belonging neighbourhood_rating
50001 2017 38.0 All All All Rural All All All All Fairly strongly All
9299 2017 64.0 A 5 minute walk or less All All All All All Adults All All All
213 2018 12.0 An 11 minute walk or more All All Urban All All All All All All
33632 2014 18.0 Within a 6-10 minute walk All All All All All Adults All All All
107271 2018 55.0 All All All All All Other All All All Very good
15027 2018 3.0 All All All All All Social Rented All All All All
113325 2013 9.0 All All All All All Social Rented All All All Fairly poor
13873 2017 15.0 An 11 minute walk or more All All All All Other All All All All
38852 2014 69.0 A 5 minute walk or less All All All All All All White All All
32962 2016 21.0 Within a 6-10 minute walk 16-34 years All All All All All All All All

Analysis questions¶

In [12]:
survey.query("nearest_green_space == 'An 11 minute walk or more' & type_of_tenure == 'Owned Outright'").sort_values('year')
Out[12]:
year percent_adults nearest_green_space age gender urban_rural_classification simd_quintiles type_of_tenure household_type ethnicity community_belonging neighbourhood_rating
10994 2013 13.0 An 11 minute walk or more All All All All Owned Outright All All All All
10932 2014 12.0 An 11 minute walk or more All All All All Owned Outright All All All All
11531 2015 13.0 An 11 minute walk or more All All All All Owned Outright All All All All
12333 2016 13.0 An 11 minute walk or more All All All All Owned Outright All All All All
12325 2017 15.0 An 11 minute walk or more All All All All Owned Outright All All All All
11849 2018 14.0 An 11 minute walk or more All All All All Owned Outright All All All All
12822 2019 14.0 An 11 minute walk or more All All All All Owned Outright All All All All

The above is an example query of the dataframe. It shows the average percentage of adults who are property owners and live further than a 10 minute walk from the nearest green space across the recorded years 2013-2019. In the analysis questions below, survey will be similarly queried and numeric results will come from the percent_adults column.

In [13]:
ProfileReport(survey)
Summarize dataset: 100%|██████████████████████████| 35/35 [00:02<00:00, 17.19it/s, Completed]
Generate report structure: 100%|███████████████████████████████| 1/1 [00:01<00:00,  1.25s/it]
Render HTML: 100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  3.03it/s]
Out[13]:

Are there certain groups that have local access to green space?¶

By 'local access' only reports of a 5 minute walk or less will be considered.

In [13]:
(survey.query("nearest_green_space == 'A 5 minute walk or less'")
 .sort_values('percent_adults', ascending=False)
).head(20)
Out[13]:
year percent_adults nearest_green_space age gender urban_rural_classification simd_quintiles type_of_tenure household_type ethnicity community_belonging neighbourhood_rating
17741 2015 80.0 A 5 minute walk or less All All Rural All All All All All All
19850 2014 79.0 A 5 minute walk or less All All Rural All All All All All All
16161 2018 77.0 A 5 minute walk or less All All Rural All All All All All All
4713 2017 75.0 A 5 minute walk or less All All Rural All All All All All All
16127 2016 75.0 A 5 minute walk or less All All Rural All All All All All All
16769 2013 75.0 A 5 minute walk or less All All Rural All All All All All All
4763 2019 73.0 A 5 minute walk or less All All Rural All All All All All All
20451 2014 72.0 A 5 minute walk or less All All All All Owned Mortgage/Loan All All All All
8852 2014 72.0 A 5 minute walk or less All All All All All With Children All All All
19360 2018 72.0 A 5 minute walk or less All All All All Owned Mortgage/Loan All All All All
1904 2015 72.0 A 5 minute walk or less All All All All All With Children All All All
20417 2015 72.0 A 5 minute walk or less All All All All Owned Mortgage/Loan All All All All
10405 2014 71.0 A 5 minute walk or less 35-64 years All All All All All All All All
20507 2013 71.0 A 5 minute walk or less All All All All Owned Mortgage/Loan All All All All
8752 2014 71.0 A 5 minute walk or less All Male All All All All All All All
9464 2018 71.0 A 5 minute walk or less All All All All All With Children All All All
10404 2013 71.0 A 5 minute walk or less 16-34 years All All All All All All All All
24180 2014 70.0 A 5 minute walk or less All All All 80% least deprived All All All All All
9498 2015 70.0 A 5 minute walk or less 35-64 years All All All All All All All All
20380 2017 70.0 A 5 minute walk or less All All All All Owned Mortgage/Loan All All All All

Overall, we see that people from rural households consistently reported short distances, followed by households with a mortgage or with children. This does not show whether the percentage of people in complementary classifications (e.g. urban households, rented households or households with no children) is significantly lower.

In [14]:
df = (survey.query(
    "nearest_green_space == 'A 5 minute walk or less' & urban_rural_classification != 'All'")
      .sort_values('year')
     )

df.head(10)
Out[14]:
year percent_adults nearest_green_space age gender urban_rural_classification simd_quintiles type_of_tenure household_type ethnicity community_belonging neighbourhood_rating
3181 2013 66.0 A 5 minute walk or less All All Urban All All All All All All
16769 2013 75.0 A 5 minute walk or less All All Rural All All All All All All
2902 2014 66.0 A 5 minute walk or less All All Urban All All All All All All
19850 2014 79.0 A 5 minute walk or less All All Rural All All All All All All
3085 2015 64.0 A 5 minute walk or less All All Urban All All All All All All
17741 2015 80.0 A 5 minute walk or less All All Rural All All All All All All
1131 2016 63.0 A 5 minute walk or less All All Urban All All All All All All
16127 2016 75.0 A 5 minute walk or less All All Rural All All All All All All
851 2017 63.0 A 5 minute walk or less All All Urban All All All All All All
4713 2017 75.0 A 5 minute walk or less All All Rural All All All All All All
In [15]:
df = (survey.query(
    "nearest_green_space == 'A 5 minute walk or less' & type_of_tenure != 'All'")
      .sort_values('year')
     )

df.head(10)
Out[15]:
year percent_adults nearest_green_space age gender urban_rural_classification simd_quintiles type_of_tenure household_type ethnicity community_belonging neighbourhood_rating
20507 2013 71.0 A 5 minute walk or less All All All All Owned Mortgage/Loan All All All All
10857 2013 67.0 A 5 minute walk or less All All All All Owned Outright All All All All
14161 2013 68.0 A 5 minute walk or less All All All All Private Rented All All All All
14083 2013 64.0 A 5 minute walk or less All All All All Social Rented All All All All
13825 2013 60.0 A 5 minute walk or less All All All All Other All All All All
14717 2014 64.0 A 5 minute walk or less All All All All Social Rented All All All All
14191 2014 68.0 A 5 minute walk or less All All All All Private Rented All All All All
12641 2014 69.0 A 5 minute walk or less All All All All Owned Outright All All All All
20451 2014 72.0 A 5 minute walk or less All All All All Owned Mortgage/Loan All All All All
13879 2014 59.0 A 5 minute walk or less All All All All Other All All All All

The aim of the below helper function is to create a plot that shows this difference:

In [16]:
greenspace_plotter(survey, 'urban_rural_classification', 'short')

We can see the effect that re-binning the variable has had by creating the same plot with the raw data:

In [17]:
   view = (
        distance.query("urban_rural_classification != 'All' & distance_to_nearest_green_or_blue_space == 'A 5 minute walk or less'")
        .groupby(['datecode', 'urban_rural_classification'])
        .mean('value')
        .reset_index()
        .sort_values('datecode')
    )

fig = (px.line(view, x="datecode", y="value", color='urban_rural_classification',
               title=f'Percentage of adults living a 5 minute walk or less from green spaces', markers=True))
fig.show('notebook')

Are there groups that are lacking access to green spaces?¶

In [18]:
(survey.query("nearest_green_space == 'An 11 minute walk or more' & community_belonging == 'All' & neighbourhood_rating == 'All'")
 .sort_values('percent_adults', ascending=False)
).head(20)
Out[18]:
year percent_adults nearest_green_space age gender urban_rural_classification simd_quintiles type_of_tenure household_type ethnicity community_belonging neighbourhood_rating
40817 2017 22.0 An 11 minute walk or more All All All All All All Other All All
40802 2019 21.0 An 11 minute walk or more All All All All All All Other All All
13947 2014 21.0 An 11 minute walk or more All All All All Other All All All All
13836 2019 20.0 An 11 minute walk or more All All All All Other All All All All
40966 2013 18.0 An 11 minute walk or more All All All All All All Other All All
36486 2017 18.0 An 11 minute walk or more All All All All All Pensioners All All All
40820 2014 18.0 An 11 minute walk or more All All All All All All Other All All
24021 2017 18.0 An 11 minute walk or more All All All 20% most deprived All All All All All
24034 2016 17.0 An 11 minute walk or more All All All 20% most deprived All All All All All
27262 2019 17.0 An 11 minute walk or more 65 years and over All All All All All All All All
27302 2019 17.0 An 11 minute walk or more All All All All All Pensioners All All All
13620 2017 16.0 An 11 minute walk or more All All All All Social Rented All All All All
28968 2013 16.0 An 11 minute walk or more All All All All All Pensioners All All All
22968 2015 16.0 An 11 minute walk or more All All All All All Pensioners All All All
13859 2013 16.0 An 11 minute walk or more All All All All Other All All All All
41002 2015 16.0 An 11 minute walk or more All All All All All All Other All All
40801 2018 16.0 An 11 minute walk or more All All All All All All Other All All
24391 2014 15.0 An 11 minute walk or more All All All 20% most deprived All All All All All
24018 2015 15.0 An 11 minute walk or more All All All 20% most deprived All All All All All
15569 2019 15.0 An 11 minute walk or more All All Rural All All All All All All

Once community and neighbourhood ratings are held constant, we can see that the top characteristics for people who live far from green spaces are: non-white ethnicity, pensioners, lowest SIMD quintile, social rented or other tenure type.

In [19]:
greenspace_plotter(survey, 'ethnicity', 'long')
In [20]:
greenspace_plotter(survey, 'household_type', 'long')

How do people in neighbourhoods with good access to green space differ from those who have no good access?¶

It has been pointed out above that perhaps the difference in average values between group types, e.g. rented or owned households, is more significant than the absolute highest value. In order to find which group types have a higher difference in means, let us calculate the standard deviation of each variable for people who live an 11 min walk or further from a green space. Begin with type_of_tenure:

In [21]:
# This calculates the average for each group
view = (survey.query("nearest_green_space == 'An 11 minute walk or more' & type_of_tenure != 'All'")
        .groupby(['type_of_tenure'])
        .mean()
        .drop('year', axis=1)
        .sort_values('percent_adults')
)
view
Out[21]:
percent_adults
type_of_tenure
Owned Mortgage/Loan 10.714286
Private Rented 13.000000
Owned Outright 13.428571
Social Rented 13.714286
Other 16.571429
In [22]:
# The standard deviation is:
view.std()
Out[22]:
percent_adults    2.091284
dtype: float64

Now calculate all the standard deviations for possible predictors to identify groups of interest. Note that the custom function just does the above for all predictors.

In [23]:
df_std(survey, 'nearest_green_space', 'An 11 minute walk or more')
Out[23]:
year percent_adults nearest_green_space age gender urban_rural_classification simd_quintiles type_of_tenure household_type ethnicity community_belonging neighbourhood_rating
percent_adults 0.843297 NaN NaN 3.94252 1.313198 0.303046 2.424366 2.091284 2.755329 3.939595 16.640635 24.446819

Out of all the demographic categories, age, ethnicity, household type, SIMD quintiles and type of tenure seem to explain most of the variation.

In [24]:
greenspace_plotter(survey, 'age', 'long')

Seems like the age plot generally matches what we would expect, as people aged 65 and over seem to be the ones further away from green spaces, same with pensioners. Given that this data is self-reported, it may be the case that people lose easy access to green spaces as they become older and takes them longer to walk, rather than the data reflecting people's houses being actually further in distance.

Let us now plot neighbourhood rating and community belonging to see whether there is a relationship between them and distance to a green space:

In [25]:
greenspace_plotter(survey, 'neighbourhood_rating', 'long')
In [26]:
greenspace_plotter(survey, 'community_belonging', 'long')

Household and community ratings seem to remain constant across the years. Among adults who live far from green spaces, most consistently report feeling very or fairly positive about their neighbourhoods and communities. Let us make use of another helper function, bar_plotter, to plot these variables relative to each other. The plot shows average percentage of adults across the years 2013-2019, for all of Scotland:

In [27]:
bar_plotter(survey, 'community_belonging', 'percent_adults', 'nearest_green_space')

We can see that the difference between more than or less than a 10 minute walk for those who feel very strongly about their communities is of 2%, and the same and 0.4% for those who feel fairly strongly. Similarly the difference in walking distances is not very high for those who feel negatively about their communities.

In [28]:
bar_plotter(survey, 'neighbourhood_rating', 'percent_adults', 'nearest_green_space')

Overall, there seems to be little relationship between how people rate their communities and neighbourhoods and how far they live from green spaces.

In [29]:
bar_plotter(survey, 'age', 'percent_adults', 'nearest_green_space')

In this plot we can appreciate that the age group shows more of a relationship with walking distance to green spaces. For people aged 65 or more, 8% less live within 5 min, and 7% more live further than 10 min.

In [30]:
bar_plotter(survey, 'urban_rural_classification', 'percent_adults', 'nearest_green_space')

As a final example, we can visualise here the difference between rural and urban households. The difference is 12% for 5 min walking distance.

Summary¶

This analysis focuses on access to green spaces in Scotland. Data comes from the Scottish Household Survey, 2013-2019, and is self-reported.

In terms of the demographics of people who live close to green spaces (defined as being within a 5 minute walk): we conclude that rural households, followed by households with a mortgage, and households with children, score the highest percentages. Variation across the years is generally minor, suggesting that access is not increasing.

In terms of people who live far from green spaces (more than a 10 min walk): non-ethnic-white households, pensioner/aged 65+ households, and households in the lowest SIMD quitile generally score the highest percentages.

Community belonging and neighbourhood rating show no relationship with green space access, as people seem to consistently report feeling fairly/very positive regardless of their access. The other report in this project focuses on building a model capable of better explaining and predicting these ratings.